# Adding a New Metric

## Before You Start

### Two different types of metrics

There are two types of metrics in Lighteval:

#### Sample-Level Metrics
- **Purpose**: Evaluate individual samples/predictions
- **Input**: Takes a `Doc` and `ModelResponse` (model's prediction)
- **Output**: Returns a float or boolean value for that specific sample
- **Example**: Checking if a model's answer matches the correct answer for one sample

#### Corpus-Level Metrics
- **Purpose**: Compute final scores across the entire dataset/corpus
- **Input**: Takes the results from all sample-level evaluations
- **Output**: Returns a single score representing overall performance
- **Examples**:
  - Simple aggregation: Calculating average accuracy across all test samples
  - Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score

### Check Existing Metrics

First, check if you can use one of the parameterized functions in
[Corpus Metrics](package_reference/metrics#corpus-metrics) or
[Sample Metrics](package_reference/metrics#sample-metrics).

If not, you can use the `custom_task` system to register your new metric.

> [!TIP]
> To see an example of a custom metric added along with a custom task, look at the [IFEval custom task](https://github.com/huggingface/lighteval/tree/main/examples/custom_tasks/ifeval).

> [!WARNING]
> To contribute your custom metric to the Lighteval repository, you would first need
> to install the required dev dependencies by running `pip install -e .[dev]`
> and then run `pre-commit install` to install the pre-commit hooks.

## Creating a Custom Metric

### Step 1: Create the Metric File

Create a new Python file which should contain the full logic of your metric.
The file also needs to start with these imports:

```python
from aenum import extend_enum
from lighteval.metrics import Metrics
```

### Step 2: Define the Sample-Level Metric

You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a
[`~lighteval.types.Doc`] and a [`~lighteval.types.ModelResponse`]. The metric should return a float or a
boolean.

#### Single Metric Example

```python
def custom_metric(doc: Doc, model_response: ModelResponse) -> bool:
    response = model_response.final_text[0]
    return response == doc.choices[doc.gold_index]
```

#### Multiple Metrics Example

If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values:

```python
def custom_metric(doc: Doc, model_response: ModelResponse) -> dict:
    response = model_response.final_text[0]
    return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5}
```

### Step 3: Define Aggregation Function (Optional)

You can define an aggregation function if needed. A common aggregation function is `np.mean`:

```python
def agg_function(items):
    flat_items = [item for sublist in items for item in sublist]
    score = sum(flat_items) / len(flat_items)
    return score
```

### Step 4: Create the Metric Object

#### Single Metric

If it's a sample-level metric, you can use the following code
with [`~metrics.utils.metric_utils.SampleLevelMetric`]:

```python
my_custom_metric = SampleLevelMetric(
    metric_name="custom_accuracy",
    higher_is_better=True,
    category=SamplingMethod.GENERATIVE,
    sample_level_fn=custom_metric,
    corpus_level_fn=agg_function,
)
```

#### Multiple Metrics

If your metric defines multiple metrics per sample, you can use the following code
with [`~metrics.utils.metric_utils.SampleLevelMetricGrouping`]:

```python
custom_metric = SampleLevelMetricGrouping(
    metric_name=["accuracy", "response_length", "confidence"],
    higher_is_better={
        "accuracy": True,
        "response_length": False,  # Shorter responses might be better
        "confidence": True
    },
    category=SamplingMethod.GENERATIVE,
    sample_level_fn=custom_metric,
    corpus_level_fn={
        "accuracy": np.mean,
        "response_length": np.mean,
        "confidence": np.mean,
    },
)
```

### Step 5: Register the Metric

To finish, add the following code so that it adds your metric to our metrics list
when loaded as a module:

```python
# Adds the metric to the metric list!
extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric)

if __name__ == "__main__":
    print("Imported metric")
```

## Using Your Custom Metric

### With Custom Tasks

You can then give your custom metric to Lighteval by using `--custom-tasks
path_to_your_file` when launching it after adding it to the task config.

```bash
lighteval accelerate \
    "model_name=openai-community/gpt2" \
    "truthfulqa:mc" \
    --custom-tasks path_to_your_metric_file.py
```

```python
from lighteval.tasks.lighteval_task import LightevalTaskConfig

task = LightevalTaskConfig(
    name="my_custom_task",
    metric=[my_custom_metric],  # Use your custom metric here
    prompt_function=my_prompt_function,
    hf_repo="my_dataset",
    evaluation_splits=["test"]
)
```
